Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make VCFHeader not throw exception if contig header lines lack length field #1418

Merged

Conversation

cwhelan
Copy link
Contributor

@cwhelan cwhelan commented Sep 19, 2019

Description

VCFHeader.getSequenceDictionary currently throws an exception if one or more of the Contig header lines does not have a length field, despite the fact that:

  • The VCF spec (v 4.2 and 4.3) does not require a length field (although it states that they are typically present)
  • If contig lines are not present in the header it simply returns null

This generally has the effect of making htsjdk tools unable to process files missing length fields without re-headering, which is not always feasible, for example in the case of large multi-sample files in cloud storage.

Fixes #389

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@codecov-io
Copy link

codecov-io commented Sep 19, 2019

Codecov Report

Merging #1418 into master will increase coverage by 0.012%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #1418       +/-   ##
===============================================
+ Coverage     68.122%   68.134%   +0.012%     
- Complexity      8382      8384        +2     
===============================================
  Files            573       573               
  Lines          33989     33992        +3     
  Branches        5668      5668               
===============================================
+ Hits           23154     23160        +6     
+ Misses          8645      8644        -1     
+ Partials        2190      2188        -2
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/variant/vcf/VCFHeader.java 90.116% <100%> (+0.058%) 76 <0> (ø) ⬇️
...n/java/htsjdk/variant/vcf/VCFContigHeaderLine.java 80.556% <100%> (+4.085%) 11 <0> (+1) ⬆️
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 90.476% <0%> (+9.524%) 4% <0%> (+1%) ⬆️

@yfarjoun yfarjoun added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label Sep 30, 2019
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative solutions suggested. Back to @cwhelan .

public SAMSequenceRecord getSAMSequenceRecord() {
final String lengthString = this.getGenericFieldValue("length");
if (lengthString == null) throw new TribbleException("Contig " + this.getID() + " does not have a length field.");
if (lengthString == null) return null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAMSequenceRecord has an UNKNOWN_SEQUENCE_LENGTH value specifically for the case where the length is missing. It would be preferable to use that here and return a SAMSequenceRecord, rather than returning null in cases where it previously would have thrown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a comment noting the SAMSequenceRecord.isSameSequence treats a record created this way (with a length of UNKNOWN_SEQUENCE_LENGTH) as matching another sequence record with the same name but a valid length.

@cwhelan
Copy link
Contributor Author

cwhelan commented Oct 1, 2019

Thanks for pointing me to SAMSequenceRecord.UNKNOWN_SEQUENCE_LENGTH @cmnbroad . I've changed this to return a SAMSequenceRecord with length set to that.

@@ -270,7 +270,9 @@ public SAMSequenceDictionary getSequenceDictionary() {

final List<SAMSequenceRecord> sequenceRecords = new ArrayList<SAMSequenceRecord>(contigHeaderLines.size());
for (final VCFContigHeaderLine contigHeaderLine : contigHeaderLines) {
sequenceRecords.add(contigHeaderLine.getSAMSequenceRecord());
final SAMSequenceRecord samSequenceRecord = contigHeaderLine.getSAMSequenceRecord();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you end up with nulls here? Shouldn't it be returning lines with UNKNOWN_SEQUENCE_LENGTH?

@@ -261,7 +261,7 @@ public void addMetaDataLine(final VCFHeaderLine headerLine) {

/**
* Returns the contigs in this VCF file as a SAMSequenceDictionary. Returns null if contigs lines are
* not present in the header. Throws SAMException if one or more contig lines do not have length
* not present in the header. Returns null if one or more contig lines do not have length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is true.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwhelan I think you forgot to update VCFHeader when you changed the behavior to return UNKNOWN_LENGTH_SEQUENCE

@lbergelson
Copy link
Member

Looks good otherwise.

@cwhelan cwhelan force-pushed the cw_contig_header_line_without_length branch from 59679d8 to ef6f244 Compare October 7, 2019 14:24
@cwhelan
Copy link
Contributor Author

cwhelan commented Oct 7, 2019

Sorry for the oversight @lbergelson. Pushed a new commit to VCFHeader to bring it up to date with the change to using UNKNOWN_LENGTH_SEQUENCE.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@lbergelson lbergelson dismissed cmnbroad’s stale review October 7, 2019 15:06

Review comments were addressed

@lbergelson lbergelson merged commit 4d73aff into samtools:master Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VCFContigHeaderLine throwing Exception although the spec doesn't require it
5 participants